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Assays that measure steroid hormones in patient care, public health, and research need to be both accurate and precise, as these 
criteria help to ensure comparability across all clinical and research applications. This review addresses major issues relevant 
to assay variability and describes recent activities by the US Centers for Disease Control and Prevention (CDC) to improve assay 
performance. Currently, high degrees of accuracy and precision are not always met for testosterone and estradiol measurements; 
although technologies for steroid hormone measurement have advanced significantly, measurement variability within and across 
laboratories has not improved accordingly. Differences in calibration and specificity are discussed as sources of variability in 
measurement accuracy. Ultimately, a combination of factors appears to cause inaccuracy of steroid hormone measurements, with 
nonuniform assay calibration and lack of specificity being two major contributors to assay variability. Within-assay variability for 
current assays is generally high, especially at low analyte concentrations. The CDC Hormone Standardization (HoSt) Program is 
improving clinical assays, as evidenced by a 50% decline in mean absolute bias between mass spectrometry assays and the CDC 
reference method from 2007 to 2011. This program provides the measurement traceability to CDC reference methods and helps 
to minimize factors affecting measurement variability. 
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INTRODUCTION 

Our understanding regarding the relevance of steroid hormones 
in health and disease has undergone substantial advancement, 
most notably with pertinence to testosterone and estradiol: where 
steroid hormones were once thought to be responsible for only 
the development of secondary sexual characteristics, current 
comprehension establishes that their significance is much broader, 
as exemplified by their influence over a wide range of biological 
functions within organs, such as bones and within the cardiovascular 
system as well. Steroid hormone levels that are excessive or deficient 
are associated with numerous diseases and chronic conditions such 
as hypogonadism, polycysticovary syndrome (PCOS), osteoporosis,' 
metabolic syndrome,^'' diabetes,*'^ cancer,*""' and cardiovascular 
diseases.' The potential impact of testosterone deficiency on public 
health was estimated in one study, wherein it was suggested that, over 
a 20-year period, testosterone deficiency would be involved in the 
development of approximately 1.3 million new cases of cardiovascular 
diseases, 1.1 million new cases of diabetes mellitus, and over 600 000 
osteoporosis-related fractures.'" The total cost of evaluating and 
providing care to women with PCOS in the United States was estimated 
in 2005 to be US$4.36 billion." 

The increased awareness about the importance of steroid hormones 
relative to human health has stimulated further research efforts, the 
results of which have already led to new guidelines for patient care. 
Many of these new research findings suggest that even relatively small 



changes in steroid hormone levels can cause changes in health status 
and augment disease risk. It is therefore crucial that hormone levels 
be detected and measured accurately, even at low concentrations. 
The reliable detection of very low levels of hormones in blood is 
also required by some treatments, such as those using aromatase 
inhibitors. Assays for measuring steroid hormones in routine patient 
care, public health, and research must meet the aforementioned needs 
and requirements. 

Detection of testosterone and estradiol in serum necessitates 
assays that accurately measure these compounds over a wide 
concentration range, which spans from less than O.lnmolh' to 
over 30 nmol h' for testosterone, and from less than 4 pmol 1"' to over 
to 10 OOOpmolh' for estradiol.'^"''' Moreover, such assays should be 
sufficiently precise to enable the detection of biologically relevant 
differences in hormone levels, such as the detection of androgen 
excess in premenopausal women, elevation in postmenopausal 
women, and deficiency in men. They must also maintain a high level 
of sensitivity to assure reliable detection of hormone levels in children, 
to monitor patients on aromatase inhibitor therapy, and at the other 
extreme, to quantitate the hyperstimulation of steroidogenesis 
due to gonadotropin-releasing hormone agonist therapy. Finally, 
these assays must be specific enough to measure only the analyte 
of interest so as to allow the correct assessment of patient status 
and treatment response. This can be very challenging, especially 
considering that more than 100 conjugated and unconjugated 
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estradiol metabolites and exogenous hormones'^"" can occur in blood 
and potentially interfere with the measurement. 

The previously outlined requirements for steroid hormone assays 
are not met for all clinical and research applications, which has been 
noted by professional organizations""^' and experts alike.^^"^' Even 
a clinical guideline that acknowledges the benefits of measuring 
steroid hormone levels for comprehensive patient care nevertheless 
recommends an omission of testing, a suggestion made due to the 
lack of reliable analytical methods.^* Currently, the lack of accurate 
and comparable estradiol measurements prevents the implementation 
of research findings in patient care. For example, certain clinical and 
epidemiological studies concluded that postmenopausal women 
with elevated levels of estradiol are at increased risk of breast cancer; 
however, a generally accepted, specific estradiol concentration with 
which to categorize women of increased risk cannot be defined. 
Because the measurements are not accurate, the concentrations that 
may be considered elevated are not comparable across studies,^' and 
thus, women at increased risk of breast cancer cannot be identified 
using current estradiol tests. 

Overcoming these challenges requires ongoing programs to 
continuously assess the analytical performance of assays, to identify 
potential problems, and to properly address their causes. Furthermore, 
these programs are needed to maintain the quality of these tests. This 
review discusses major factors affecting estradiol and testosterone 
measurements and describes activities by the US Centers for Disease 
Control and Prevention (CDC) to improve measurement reliability. 

ASSAYTECHNOLOGIES FOR MEASURING STEROID HORMONES 
IN PATIENT CARE 

Initially, steroid hormone measurements were performed using 
radioimmunoassays (RIAs). RIAs include steps to isolate hormones 
from serum before the actual RIA measurement, and these steps 
typically consist of some form of chromatography. Because the actual 
hormone measurement is performed after isolation of the analyte, 
these assays are called 'indirect' assays. To reduce cost, throughput, 
and the overall complexity of operating these indirect assays, so-called 
'direct' assays were developed that measure steroid hormones directly 
in serum. These direct RIAs were further simplified by replacing 
radioisotope measurements with other techniques. Currently, almost 
all immunological testosterone and estradiol methods performed in 
patient care are direct assays.^"''" 

Analytical methods for measuring steroid hormones by using 
mass spectrometry have been in use since the 1960s. Similar to the 
indirect immunoassays, these methods include a step to isolate the 
steroid hormones from serum. After isolation, the steroid hormones 
are derivatized to enable further chromatographic separation of isomers 
and other structurally similar compounds by gas chromatography. 
After chromatographic separation, the steroid hormones are detected 
based on their specific mass using mass spectrometry. These methods 
are highly specific and can provide highly accurate measurements of 
multiple steroid hormones simultaneously. Because of their complexity, 
low throughput, and high specimen volume requirements, these 
methods were used only as reference methods and to address specific 
research questions.^' Advancements in mass spectrometry enabled the 
use of liquid chromatography instead of gas chromatography, and the 
introduction of tandem mass spectrometry increased the detection 
specificity and sensitivity. Other technical developments enabled 
automation of sample preparation and analysis. As a consequence, 
liquid chromatography coupled with tandem mass spectrometry is 
increasingly used in patient care, public health, and research.'""'^ 



Testosterone circulates in the blood either tightly bound to sex 
hormone binding globulin (SHBG), loosely bound to albumin, or 
unbound (free). Testosterone bound to SHBG is considered unavailable 
to target tissues, and only free and albumin-bound testosterone are 
considered bioavailable. For many clinical applications, measurement 
of total testosterone can be considered adequate for the evaluation of a 
patient. However, certain conditions, such as those where SHBG levels 
might be altered, may require measurement of free or bioavailable 
testosterone. 

Free testosterone is traditionally estimated using equilibrium 
dialysis. With this method, serum is dialyzed across a semipermeable 
membrane that retains all protein-bound hormones, and the 
testosterone that crosses the membrane is measured and considered 
to represent free, unbound testosterone."""'* Other methods employ 
ultrafiltration to separate unbound and SHBG-bound hormones." 
These methods can be very precise but are time-consuming, and they 
are difficult to operate and automate. To overcome these challenges, 
so-called 'analog-based assays' were developed, which intend to 
measure free hormone directly in serum. However, concerns have 
been raised about the reliability of analog-based assays,*""''^ leading to 
the recommendation not to use analog-based assays for determining 
free hormones."'" 

The profound technical improvements in steroid hormone 
measurement technologies resulted in a wide variety of methods 
available to physicians, researchers, and public health professionals. 
However, these technical improvements did not result in profound 
improvements in patient care. A large body of patient data and public 
health information is created with these different methods, as reflected 
in the increasing number of scientific publications, clinical studies, 
and increased demand for tests in patient care; however, generally 
accepted reference ranges, clinical decision levels, and guidelines 
for patient care do not exist or cannot be applied in most cases. One 
reason for this situation is the variability and lack of comparability of 
measurement results among different methods, laboratories, and over 
time. The variability in measurement results of testosterone and estradiol 
assays has been known for many years,'""" and it can be attributed 
mainly to differences in assay accuracy, specificity, and imprecision. 

ASSESSMENT OF FACTORS CONTRIBUTING TO VARIABILITY 
IN MEASUREMENT ACCURACY 

Variability in measurement accuracy, defined as closeness of agreement 
between a measured value and its true value, can be caused by 
inconsistent or inaccurate calibration or by interfering compounds. 

Studies comparing measurement results for testosterone and 
estradiol obtained with immunoassays against those measurement 
results obtained with mass spectrometric assays found high 
disagreement between the assays.*''""'' In most studies, the absolute 
measurement bias between an assay and the comparison method was 
highest at low analyte concentrations and can be more than 100% with 
individual samples. The patterns of individual sample bias over the 
investigated concentration ranges differed among assays (Figure 1). 
One comparison study of commercial immunoassays for testosterone 
against a mass spectrometry method reported a reasonable overall 
agreement (R^ > 0.92) at testosterone concentrations of >4 nmoU ' 
and a mean bias ranging between 35% and 50%. The correlation was 
less strong (R-^, 0.59-0.97) at concentrations of <4nmoll"', where the 
mean bias ranged from 5% to 220%.'-' Another study evaluated an 
immunoassay standardized for testosterone by the CDC Hormone 
Standardization (HoSt) Program against a mass spectrometric assay 
using samples from 3174 adult males (testosterone levels: >8nmoll"'). 
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Figure 1: Differences in serum testosterone levels between commercial 
immunoassays and a liquid chromatography tandem mass spectrometry 
method of individual patient samples. Adapted with permission from Wang, 
et al.'^'> 



The authors found good agreement (R^ = 0.93) between the two assays 
with the 95% confidence interval (CI) of the slope and intercept of the 
regression plot ranging from 0.96 to 0.99 and 0.18 to 0.65nmoll 
respectively. The mean bias was 0.77%. No notable differences in the 
population distribution were observed when comparing data from 
both assays. However, the comparison of estradiol measurements 
performed in this population differed notably, with the 95% CI of the 
slope and intercept ranging from 1.15 to 1.25 and 0.75 to 7.81 pmoU"', 
respectively. The mean bias was 30. 1%. The authors concluded that this 
immunoassay provides reliable data for testosterone in normal men, 
while it is unreliable for estradiol in men." 

Mass spectrometric methods are frequently used as a comparison 
method, because they are considered more specific than direct 
immunoassays. However, studies investigating the variability among 
mass spectrometry methods used in research and patient care found 
these methods to be inaccurate as well; relative to immunoassays, the 
mass spectrometric methods presented similar bias patterns, albeit 
much less pronounced.^*'" Therefore, assessment and improvement of 
measurement accuracy need to include all assay technologies. 

The different bias patterns among assays and the changes in bias 
with the analyte concentration within assays suggest that different 
factors affect the measurement accuracy of an assay and that each assay 
is affected differently by these factors. 

VARIABILITY CAUSED BY DIFFERENCES IN MEASUREMENT 
SPECIFICITY 

Typically, increasing positive bias with decreasing analyte 
concentration (Figure lb) suggests that the assay is measuring other 
compounds in addition to the analyte and lacks specificity. Certain 
medications and other steroids such as estriol or dehydroepiandrosterone 
sulfate are known to potentially interfere with steroid hormone 
measurements. Improvement in measurement accuracy 



has been shown upon introduction of liquid-liquid extraction or 
chromatography that removes compounds with similar characteristics 
from the sample prior to the actual measurement, especially at low 
analyte concentrations typically observed for testosterone in women 
and estradiol in men and postmenopausal women. "■^'■"■^ 

Increasing negative bias with decreasing analyte concentration 
(Figure Id) could be explained by incomplete dissociation of the 
hormone from the SHBG prior to measurement. Also, interfering 
compounds with higher affinity for the antibody relative to the analyte's 
affinity may cause incomplete recovery of the analyte, resulting in a 
negative bias; interfering compounds with low cross-reactivity but 
present at high concentration levels may also result in incomplete 
recovery." 

Measurement specificity can be assessed by determining 
the measurement accuracy after adding potentially interfering 
compounds to patient samples. However, the compounds interfering 
with the measurement are frequently not known. Therefore, 
measuring panels of individual patient samples with target values 
assigned by a reference measurement procedure can help to identify 
inaccuracies caused by interfering compounds as described later in 
this review. 

VARIABILITY CAUSED BY DIFFERENCES IN MEASUREMENT 
CALIBRATION 

Measurement bias that is constant and independent of the analyte 
concentration is typically caused by differences in caUbration and is 
expressed as mean bias calculated from a panel of samples spanning 
the reportable range of the assay. 

CaUbration bias can be minimized by use of a common calibrator 
or reference material and by following ISO standard 17511 for 
calibration and target value assignment.*' Pure compound, certified 
reference materials for testosterone and estradiol are available from the 
National Measurement Institute in Australia and the National Institute 
of Advanced Industrial Science and Technology in Japan, respectively. 
Many clinical immunoassays are optimized for measurements in serum 
and cannot use pure compound materials as calibrators. Serum-based 
reference materials created from pooled sera are available from the 
National Institute for Standards and Technology (NIST) for testosterone 
and the Institute for Reference Materials and Measurements (IRMM) 
for estradiol. Furthermore, panels of sera from individual donors with 
values for testosterone and estradiol assigned by a reference method 
are available from the CDC. The target values assigned to these 
serum-based reference materials are traceable to the certified, pure 
compound reference materials in line with requirements described in 
ISO standard 17511. 

The use of a common calibrator and establishing metrological 
traceabUity as described in ISO 1 75 1 1 in itself does not assure accurate 
measurements in patients. The process of calibration and assigning 
target values can introduce measurement bias if not conducted 
carefully. Non-commutable reference materials applied as calibrators 
or for calibration verification can result in or suggest measurement 
bias, especially with direct immunoassays optimized for use in 
patient care. The importance of commutability of materials used for 
calibration was emphasized in several reviews and commentaries.™"''^ 
Commutability is a material characteristic that describes how well 
measurement results obtained with a reference material correlate with 
measurement results obtained with patient samples. Protocols for 
assessing commutability of materials were established by the Clinical 
and Laboratory Standards Institute (CLSI).*"' An alternate approach 
that does not require commutability assessments employs the use 
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single donor patient samples for calibration or calibration verification. 
By using panels of patient samples for bias assessment rather than few 
pooled materials, the likelihood for an interfering compound to mask 
a calibration bias can be minimized. 

Sera depleted of testosterone or estradiol using charcoal (charcoal 
stripped' sera) are frequently used as matrices for preparing calibrator 
solutions, especially with mass spectrometric assays. Remaining 
charcoal fines in such modified sera can adsorb testosterone or 
estradiol and thus lead to inaccurate calibrator solutions. Therefore, 
the accuracy of calibrator solutions needs to be assured before they 
can be used as calibrators. 

Serial dilution of calibrators can result in propagation of dilution 
errors and in inaccurate calibration solutions.'"* Minimizing serial 
dilutions and appropriately correcting for any dilution error can 
minimize this source of variability. 

VARIABILITY IN PRECISION 

Measurement precision, defined as the closeness of agreement between 
independent test results obtained on the same sample and with the 
same assay, mainly depends for the robustness of the measurement 
procedure and instrumentation used with the procedure. The assay 
precision is typically described by the imprecision and expressed as 
percent coefficient of variation (CV). 

The within-assay imprecision for estradiol and testosterone assay 
is very inconsistent at low analyte concentrations. The imprecision 
of commercial immunoassays for testosterone was found to 
be <8% at concentrations >24nmoll"' and between 6.1% and 22% 
at concentrations <3nmoll The imprecision of commercial 
immunoassays for estradiol was reported to range from 1.2% to 42.6% 
CV at concentrations from 18 to 846pgml ' (66-3105 pmolh'). At an 
estradiol concentration of 1 8 pg ml ' (66 pmol 1"'), the imprecision was 
very inconsistent among these assays, and it became more uniform at 
levels of 93pgmh', with CVs ranging from 2.5% to 9.4% CV.** The 
reasons for these differences in within-assay imprecision are not fully 
understood. One explanation could be the analytical system having 
problems distinguishing the signal obtained from the instrument as a 
true analyte signal rather than background noise as the concentration 
approaches zero. 

One study investigated the among- assay precision by comparing 
measurement results from the College of American Pathologists 
Survey (Y Ligand Special, September 2004), where the same sample was 
measured by different laboratories using the same testosterone assay. 
This study found that three out of nine assays were unable to produce 
peer group CVs lower than 10% at 12.8 nmoU a concentration level 
typically observed in men. Not a single assay was able to produce a 
peer group CV below 10% at 2.7 nmoll"', a concentration level typically 
observed in women. These findings were found to be consistent with 
data from other external quality assesment/proficiency testing (EQA/ 
PT) programs. The authors pointed out that data from PT/EQA 
programs may also reflect other sources of variability, such as potential 
data entry errors, reagent lot-to-lot variability, and specimen stability. 
Despite these limitations, the high imprecision among laboratories 
using the same assay warrants further investigation.''^ 

Mass spectrometric methods employ different types of sample 
preparation to isolate hormones from the serum matrix. Typical 
procedures are solid-phase extraction and liquid-liquid extraction, 
either with or without prior protein precipitation. One study 
investigated the impact of different sample preparation procedures 
on imprecision and data distribution. The researchers found that the 
different procedures can have different precisions. The authors assumed 



these observations to be related to differences in analyte recovery and 
removal of the hormone from SHBG.'''' 

THE CDC HOST PROGRAM TO IMPROVE TESTOSTERONE AND 
ESTRADIOL MEASUREMENTS 

The CDC HoSt Program helps in establishing traceability of 
measurements performed in patient care to a common reference 
material by assigning target values to individual donor sera using 
its reference methods.''' These reference methods are calibrated with 
pure compound, certified reference materials. The use of single donor 
sera for calibration or calibration verification overcomes potential 
commutability issues frequently observed with pooled or otherwise 
modified sera. Aligning assays to a common reference material is 
accomplished in phase 1 of the CDC Hormone Standardization 
Program, where 40 samples along with target values are provided to 
participants for calibration or calibration verification (Figure 2). 

To assure the established calibration remains accurate and 
consistent over time, the participant is challenged with 10 samples 
per calendar quarter with analyte concentrations unknown to the 
participant; evaluation of measurement bias is accomplished using 
data from four consecutive challenges. Testosterone assays with a mean 
bias of ±6.4% are considered sufficiently accurate and standardized. 
This bias criterion was reviewed and suggested by an expert group of 
the Partnership for the Accurate Testing of Hormones (PATH).*"* AU 
participants were asked to perform replicate measurements on the 
same samples, which allowed for the assessment of assay imprecision 
as well as bias. 

Measurement performance — specifically measurement accuracy — 
can change over time as new batches of calibrators and reagents are 
prepared and as instrumentation and other measurement conditions 
change. Interlaboratory comparison studies provide information 
about assay performance at one point in time, but they cannot detect 
changes in variability over time. Furthermore, results from different 
interlaboratory comparison studies performed with the same assays 
may not be comparable because of differences in study designs such 
as type and concentration range of samples used. 



Phase 1- 

40 single donor serum samples with 
values assigned | 



Calibration/Caiibration Verification 



Chaiienge 1 
(1st Quarter) 



Phase 2 - 

1 0 blinded single donor serum 
samples per challenge 



Challenge 2 
(2nd Quarter) 



Challenge 3 
(3rd Quarter) 



Challenge 4 
(4th Quarter) 



Bias Estimation 
using all 4 consecutive challenges 



Performance evaluation- 
using CLSI EP09-A2 with defined 
bias cnterion denved from biological | 
variability data 

Figure 2: The US Centers for Disease Control and Prevention Hormone 
Standardization Program scheme for assessing assay calibration (Phase 1) 
and certifying assay accuracy (Phase 2) using single-donor serum samples. 
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The CDC HoSt Program is evaluating measurement accuracy 
over time through its quarterly challenges. The specimens used in this 
program are collected, processed, and shipped using standardized 
protocols. Reference values are assigned to these sera using reference 
methods recognized by Joint Committee for TraceabUity in Laboratory 
Medicine, and the measurement bias is assessed following an established 
protocol from the CLSI." This allows the CDC HoSt Program to 
consistently and reliably assess measurement performance over time. 

Figure 3 show the measurement bias observed with participants 
in the CDC HoSt Program over 2.5 years (10 quarterly challenges) for 
testosterone. The measurement bias for participant A is highly variable 
among the first quarterly challenges, with bias ranging between 10% 
and 20%. Using the information obtained from quarterly challenges, 
the participant was able to identify the sources of variability and 
to minimize measurement bias, which is reflected in the small 
measurement bias observed in the last three quarters. 

With participant B, the individual sample biases obtained from 
male donor samples appear to be consistent within ± 10% over 2.5 years. 
However, the samples from female donors consistently have high 
measurement biases of up to ±100% difference from the target value. 
These data suggest that the assay is calibrated consistently over time, 
but seems to have interferences that cause high measurement bias in 
female samples. 

The individual sample bias observed with participant C on sera 
from male and female donors appears mostly consistent within 
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Figure 3: Measurement bias among three assays by HoSt participants (a, b, c) 
and the US Centers for Disease Control and Prevention reference laboratory 
on individual HoSt samples measured over 10 quarters. 



each quarter, suggesting that the measurement accuracy does not 
change with the testosterone concentration or gender of the serum 
donor, and thus that the assay has high specificity. However, the 
mean bias from each set of quarterly samples seems to change 
over time, ranging between 20 and 10%. Because the assay shows 
a high degree of specificity, it can be assumed that the observed 
variability in measurement bias is caused mainly by variability in 
calibration. Further investigation revealed that changes in in-house 
prepared calibrator batches coincided with changes in measurement 
accuracy. 

These observations demonstrate that assay accuracy can change 
over time and is best monitored using individual donor samples that 
cover the reportable range of the assays. Furthermore, the use of 
single-donor serum samples can help distinguish between calibration 
bias as well as bias due to interfering substances. This greatly facilitates 
identifying and addressing the source of measurement bias and 
variability among assays. 

The impact of the CDC HoSt Program is indicated in improvements 
in measurement accuracy among mass spectrometric methods for 
testosterone (Figure 4) . The among-laboratory variability for the mass 
spectrometric methods, expressed in the mean absolute bias calculated 
from the bias observed with different assays on individual samples, 
declined by approximately 50% across samples from 2007 to 2011. 
Improvements with immunoassays are anticipated. 

SUMMARY 

Clinical measurements of testosterone and estradiol are important 
for the diagnosis, treatment, and prevention of many diseases. 
Currently, the analytical performance of individual assays may 
not be appropriate to meet the needs of all clinical applications. 
This is especially true with regard to measurements of low analyte 
concentrations, such as those for testosterone observed in women and 
estradiol in men and postmenopausal women; accuracy in measuring 
low analyte concentrations remains challenging for most assays. 
Therefore, the reliability of a particular assay must be assessed in 
light of the intended use, and it cannot be generalized to all potential 
applications. The CDC HoSt Program is improving clinical assays 
by providing traceability of measurements to the highest available 
reference and by helping laboratories to understand factors afliecting 
measurement variability. Further research and improvements are 
needed to minimize measurement variability and assure appropriate 
patient care. 
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